Introduction

In this analysis, I have adapted a methodology originally used to predict the occurrence of crime to predict 311 calls for street trees. It is likely that selection bias plays a similar role in this dynamic as it does in predictive policing, for a number of reasons. These include the fact that some philly residents are unaware of or unable to access the 311 service due to technological or language barriers, whereas others might be ‘power users’ of the service who might over-use the service by submitting multiple requests for a single event. Additionally, residents in wealthier neighborhoods might be more concerned about the aesthetic appeal of street trees, while a resident in a less affluent area might be more concerned about basic services. Finally, factors related to the trees themselves, such as size or presence of infrastructure features like sewer inlets or aboveground wires can lead to higher tree mortality rates or higher rates of trees coming into conflict with infrastructure.These factors and others could lead to a disproportionate number of requests from certain areas or demographics, causing those areas of the city to be under- or over-represented in the data, independent of the actual distribution of street tree work needed.

The purpose of adapting this predictive policing algorithm to predict Street Tree 311 calls is to facilitate preventative care and avoid property damage from fallen trees. It must be said that this theoretical care is highly aspirational, as there are currently thousands of tree work calls waiting to be resolved. However, if the city were able to clear the backlog of tree maintenance requests, this tool could be used to focus preventative tree care efforts in areas most likely to experience 311 calls about street trees. After all, as the famous Ben Franklin quote goes, ‘an ounce of prevention is worth a pound of cure.’

Setup

Clear Environment

Load Libraries

library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(classInt)   # for KDE and ML risk class intervals
library(tigris)
library(arcpullr)
library(units)
library(zip)
library(rgdal)
library(osmdata)

Data Wrangling

# # 311 Requests API (needs work)
# # StreetTreeRequests_ALL <- get_spatial_layer("https://services.arcgis.com/fLeGjb7u4uXqeF9q/ArcGIS/rest/services/philly311__public_cases/FeatureServer/0/") %>%
# #   filter(service_name == 'Street Trees') %>%
# #   filter(lat != "") %>%
# #   st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
# #   st_transform(crs = coordinate_system)
#
# # 311 Requests (by year)
# # StreetTreeRequests <- st_read("/Users/oliveratwood/Downloads/Philly 311/public_cases_fc2022.csv") %>%
# # StreetTreeRequests <- st_read("/Users/oliveratwood/Downloads/Philly 311/public_cases_fc2021.csv") %>%
# StreetTreeRequests <- st_read("/Users/oliveratwood/Downloads/Philly 311/public_cases_fc2019.csv") %>%
# # StreetTreeRequests <- st_read("/Users/oliveratwood/Downloads/Philly 311/public_cases_fc2018.csv") %>%
#   filter(service_name == 'Street Trees') %>%
#   filter(lat != "") %>%
#   st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
#   st_transform(crs = coordinate_system)
#
# # Specify the file path where you want to save the GeoJSON file
# # file_path <- "/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/StreetTreeRequests2022.geojson"
# # file_path <- "/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/StreetTreeRequests2021.geojson"
# file_path <- "/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/StreetTreeRequests2019.geojson"
# # file_path <- "/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/StreetTreeRequests2018.geojson"
#
# # Write the data frame as GeoJSON
# st_write(StreetTreeRequests, file_path)
#
# LOAD +WRITE TREE INVENTORY
# trees <- get_spatial_layer("https://services.arcgis.com/fLeGjb7u4uXqeF9q/arcgis/rest/services/PPR_Tree_Inventory_2022/FeatureServer/0/") %>%
#   filter(TREE_DBH > 0, TREE_DBH < 100) %>%
#   st_transform(crs = coordinate_system) %>%
#   st_join(philly_boundary) %>%
#   mutate(Legend = "PhillyTrees") %>%
#   mutate(Legend2 = "DBH")
#
# # Write the data frame as GeoJSON
# file_path <- "/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/PPR_Tree_Inventory_OA.geojson"
# st_write(trees, file_path)
#
# # Compress the file using gzip
# system(paste("gzip", file_path))

Set Parameters

# Coordinate System
coordinate_system <- 2272

# # Resolution
resolution <- 600

palette_sequential <- c("#AF2E1B", "#CC6324", "#BFA07A", "#E2CEBA", "white")
palette_diverging <- c("#AF2E1B", "white", "#3B4B59")

# # #Load Boundary File
# query <- opq("Philadelphia, PA, USA") %>%
#    add_osm_feature("boundary", "administrative")
# 
# philly_boundary <- osmdata_sf(query)
# philly_boundary <- philly_boundary$osm_polygons
# philly_boundary <- philly_boundary %>% st_transform(crs = coordinate_system)
# 
# #Load Boundary File
# philly_boundary <- counties(state = 42) %>%
#   filter(NAME == "Philadelphia")

#PHILLYBOUNDARY
philly_boundary <- st_read("https://raw.githubusercontent.com/olivegardener/musa_5080_2023/main/Week_6/PredictiveTreeing/Data/City_Limits.geojson") %>%
  st_transform(crs = 2272)

Load data for independent variables

#TREE INVENTORY
# Load and Unzip tree inventory
file_path_gz <- "https://raw.githubusercontent.com/olivegardener/musa_5080_2023/main/Week_6/PredictiveTreeing/Data/PPR_Tree_Inventory_OA.geojson.gz" # Unzipping and reading the content
con <- gzcon(file(file_path_gz, "rb"))
geojson_content <- readLines(con)
close(con)

temp_file <- tempfile(fileext = ".geojson")
writeLines(geojson_content, temp_file) # Write uncompressed content to a temporary file
trees <- st_read(temp_file, quiet = TRUE) %>%  # Read the GeoJSON from the temporary file
  st_transform(crs = 2272)
unlink(temp_file) # Delete the temporary file after reading it

# ACS DATA
#2018 for Philly
tracts2018 <-
  get_acs(geography = "tract",
          variables = c("B25026_001E",
                        "B19013_001E",
                        "DP04_0046PE"),
          year=2018, state=42, county="101",
          geometry=TRUE, output="wide") %>%
  rename(TotalPop = B25026_001E,
         MedHHInc = B19013_001E,
         HomeOwnRate = DP04_0046PE) %>%
  mutate(area = st_area(geometry)) %>%
  mutate(year = "2018",
         PopDensity = drop_units(TotalPop / (area / 2.788e+7))) %>%
  mutate(HomeOwnRate = ifelse(is.na(HomeOwnRate), 0, HomeOwnRate)) %>%
  mutate(MedHHInc = ifelse(is.na(MedHHInc), 0, MedHHInc)) %>%
  dplyr::select(-NAME, -TotalPop, -area, -starts_with("D"), -starts_with("B"))%>%
  st_transform(crs = 2272)

# NEIGHBORHOODS
neighborhoods <- get_spatial_layer("https://services1.arcgis.com/a6oRSxEw6eIY5Zfb/ArcGIS/rest/services/Philadelphia_Neighborhoods/FeatureServer/0")%>%
  st_transform(crs = 2272)

# Points
#VACANT LOTS
gdb_path <- "/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Midterm/data/Features4PPA.gdb"
layers <- ogrListLayers(gdb_path)
Vacant_Lots <- sf::st_read(gdb_path, layer = "VacantLots")%>%
  st_transform(crs = 2272) %>%
  mutate(Legend = 'VacantLots') %>%
  dplyr::select(Legend) %>%
  st_centroid(Vacant_Lots)

#SEWER INLETS
SewerInlets <- st_read("/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/PhillyWater_INLETS/PhillyWater_INLETS.shp") %>%
  st_transform(crs = 2272) %>%
  dplyr::filter(SYMBOLGROU == "Inlet") %>%
  mutate(Legend = 'SewerInlets') %>%
  dplyr::select(Legend)

# Lines
#HISTORIC STREAMS
HistoricStreams <- st_read("/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/HistoricStreams_Arc-shp/0e0b38fd-a993-4007-88a3-0373a0035f77202041-1-vo11h2.z0onl.shp")%>%
  st_transform(crs = 2272) %>%
  mutate(Legend = 'HistoricStreams') %>%
  dplyr::select(Legend)

#CURBLINES
curblines <- st_read("/Users/oliveratwood/Documents/GitHub/musa_5080_2023/Week_6/PredictiveTreeing/Data/curblines/curblines.shp")%>%
  st_transform(crs = 2272) %>%
  mutate(Legend = 'curblines') %>%
  dplyr::select(Legend)

Load in Selected 311 Data

# Load Street Tree 311 Requests (previously cleaned, see commented code above)
StreetTreeRequests <- st_read("https://raw.githubusercontent.com/olivegardener/musa_5080_2023/main/Week_6/PredictiveTreeing/Data/StreetTreeRequests2018.geojson")%>%
  st_transform(crs = 2272)

Preliminary Analysis

A map of the outcome of interest

# uses grid.arrange to organize independent plots
grid.arrange(ncol=2,
ggplot() +
  geom_sf(data = philly_boundary) +
  geom_sf(data = StreetTreeRequests, colour="orange", size=0.1, show.legend = "point") +
  labs(title= "Tree 311 Calls, Philadelphia - 2022") +
  mapTheme(title_size = 10),

ggplot() +
  geom_sf(data = philly_boundary, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(StreetTreeRequests)),
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Tree 311 Calls") +
  mapTheme(title_size = 10) + theme(legend.position = "none"))


These maps show how street tree 311 calls are distributed across the city. It would appear that there are some hotspots.

## using {sf} to create the grid
## Note the `.[philly_boundary] %>% ` line. This is needed to clip the grid to our data
fishnet <- 
  st_make_grid(philly_boundary,
               cellsize = 1000, 
               square = TRUE) %>%
               # hexagon = TRUE) %>% #option for a hexagonal grid
  .[philly_boundary] %>%            # fast way to select intersecting polygons
  st_sf() %>%
  mutate(uniqueID = 1:n())

A map of the outcome of interest joined to the fishnet

## add a value of 1 to each tree request, sum them with aggregate
tree_net <- 
  dplyr::select(StreetTreeRequests) %>% 
  mutate(countRequests = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countRequests = replace_na(countRequests, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = tree_net, aes(fill = countRequests), color = NA) +
  scale_fill_viridis() +
  labs(title = "Count of Street Tree Service Requests for the Fishnet") +
  mapTheme()

Joining the count of street tree 311 requests to the fishnet highlights these hotspots, where single cells are bright with a very high number of requests! Could this be indicative of actual concentrations of tree work needing to be done, or is it rather just a small handful of particularly persistent citizens?

Feature Engineering

Nearest Neighbor Features

## NN from trees
vars_net <- fishnet %>%
    mutate(trees.nn = nn_function(st_coordinates(st_centroid(fishnet)), 
                                           st_coordinates(trees),
                                           k = 5))
## Join NN feature to our fishnet
vars_net <- left_join(tree_net, st_drop_geometry(vars_net), by="uniqueID") 


# ## NN from vacant lots
# vars_net <- vars_net %>%
#     mutate(Vacant_Lots.nn = nn_function(st_coordinates(st_centroid(fishnet)),
#                                            st_coordinates(Vacant_Lots),
#                                            k = 5))
# ## Join NN feature to our fishnet
# vars_net <- left_join(tree_net, st_drop_geometry(vars_net), by="uniqueID")

Joining Spatial Features

vars_net <- trees %>%
  st_join(vars_net, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n(), 
            sum_TREE_DBH = sum(TREE_DBH, na.rm = TRUE), # Calculate sum of TREE_DBH
            mean_TREE_DBH = mean(TREE_DBH, na.rm = TRUE)) %>% # Calculate sum of TREE_DBH
  left_join(vars_net, ., by = "uniqueID") %>%
  spread(Legend, count, fill=0) %>% 
  dplyr::select(-`<NA>`) %>%
  ungroup() %>% 
  mutate(sum_TREE_DBH = ifelse(is.na(sum_TREE_DBH), 0, sum_TREE_DBH)) %>% 
  mutate(mean_TREE_DBH = ifelse(is.na(mean_TREE_DBH), 0, mean_TREE_DBH))  # Replace NA values with 0


vars_net <- SewerInlets %>% 
  st_join(vars_net, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>% # Calculate count of trees
  left_join(vars_net, ., by = "uniqueID") %>%
  spread(Legend, count, fill=0) %>% 
  dplyr::select(-`<NA>`) %>%
  ungroup()

vars_net <- Vacant_Lots %>% 
  st_join(vars_net, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>% # Calculate count of trees
  left_join(vars_net, ., by = "uniqueID") %>%
  spread(Legend, count, fill=0) %>% 
  dplyr::select(-`<NA>`) %>%
  ungroup()


# ##JOIN CURBS
# # Calculate intersections
# intersections <- st_intersection(vars_net, curblines)
# 
# # Count intersections by fishnet uniqueID
# intersection_counts <- intersections %>%
#   group_by(uniqueID) %>%
#   summarize(count = n()) %>%
#   ungroup()
# 
# # Join counts back to the original fishnet (if needed)
# vars_net <- st_join(vars_net, intersection_counts, by = "uniqueID") %>% 
#   mutate(curbcount = ifelse(is.na(count), 0, count)) %>% 
#   dplyr::select(-count, -uniqueID.x, -uniqueID.y) %>%
#   mutate(uniqueID = row_number())


# ##JOIN HISTORIC STREAMS
# # Calculate intersections
# intersections <- st_intersection(vars_net, HistoricStreams)
# 
# # Count intersections by fishnet uniqueID
# intersection_counts <- intersections %>%
#   group_by(uniqueID) %>%
#   summarize(count = n()) %>%
#   ungroup()
# 
# # Join counts back to the original fishnet (if needed)
# vars_net <- st_join(vars_net, intersection_counts, by = "uniqueID") %>% 
#   mutate(H_Streams = ifelse(is.na(count), 0, count)) %>% 
#   dplyr::select(-count, -uniqueID.x, -uniqueID.y) %>%
#   mutate(uniqueID = row_number())

Joining areal data

final_net <- st_centroid(vars_net) %>%
    st_join(dplyr::select(tracts2018, MedHHInc), by = "uniqueID") %>%
    st_join(dplyr::select(tracts2018, PopDensity), by = "uniqueID") %>%
    st_join(dplyr::select(tracts2018, HomeOwnRate), by = "uniqueID") %>%
    st_join(dplyr::select(neighborhoods, NAME), by = "uniqueID") %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(vars_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

final_net <- final_net %>% 
  rename(neighborhoods = NAME)

# for live demo
# mapview::mapview(final_net, zcol = "District")

Assessing Independent Variables

A small multiple map of risk factors in the fishnet.

final_net.long <- final_net %>% 
  dplyr::select(-neighborhoods, -cvID)

final_net.long <- gather(final_net.long, Variable, value, -geometry, -uniqueID, -countRequests)
vars <- unique(final_net.long$Variable)

vars <- unique(final_net.long$Variable)
mapList <- list()

for(i in seq_along(vars)){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.long, Variable == vars[i]), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=vars[i]) +
      mapTheme()
}


do.call(grid.arrange,c(mapList, ncol=3, top="Influencing Factors by Fishnet"))


This map reveals some of these features have similar spatial distribution to street tree 311 calls.

Calculate Local Moran’s I

## generates warnings from PROJ issues
## {spdep} to make polygon to neighborhoods... 
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

# print(final_net.weights, zero.policy=TRUE)
local_morans <- localmoran(final_net$countRequests, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

# join local Moran's I results to fishnet
final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(TreeCount = PhillyTrees, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)

A small multiple scatterplot with correlations.

correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID, -neighborhoods) %>%
    gather(Variable, Value, -countRequests)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countRequests, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, countRequests)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Street Tree 311 Calls as a function of risk factors") +
  plotTheme()


These scatterplots suggest a number of things about the relationship between street tree 311 calls and our factors of interest. We see some strong positive correlations between popDensity, PhillyTrees, SewerInlets, and sum_TREE_DBH, suggesting that the people, more trees, stormwater infrastructure, and the bigger the trees of a given area, the more likely that area is to place street tree 311 calls.

A histogram of the dependent variable.

ggplot(final_net, aes(x=countRequests)) +
  geom_histogram(binwidth=1, fill="blue", color="black", alpha=0.7) + # You can adjust binwidth as needed
  labs(title="Histogram of Street Tree 311 Calls",
       x="Number of Calls",
       y="Frequency") +
  theme_minimal()


This histogram shows that across the city, most fishnet cells do not have any calls, and there is a steady dropoff from cells with lower numbers of calls to those with higher numbers of calls. Notable here is that at the high end, there are increases in the frequency of calls, which suggests some areas have people who are ‘frequent callers’ or trees who are ‘frequent fallers.’ Perhaps we should remove them?

Modeling and Cross-Validation

Choose Variables

# Version 1
# reg.vars <- c("trees.nn", "sum_TREE_DBH", "mean_TREE_DBH", "PhillyTrees",
#               "SewerInlets", "VacantLots", "curbcount",
#               "H_Streams", "MedHHInc", "PopDensity")
# 
# reg.ss.vars <- c("trees.nn", "sum_TREE_DBH", "mean_TREE_DBH", "PhillyTrees",
#                   "SewerInlets", "VacantLots", "curbcount",
#                   "H_Streams", "MedHHInc", "PopDensity", "treecall.isSig", "treecall.isSig.dist")

#  Version 2
# reg.vars <- c("trees.nn", "sum_TREE_DBH", "PhillyTrees",
#               "SewerInlets", "VacantLots",
#               "H_Streams", "MedHHInc", "PopDensity")
# 
# reg.ss.vars <- c("trees.nn", "sum_TREE_DBH", "PhillyTrees",
#                   "SewerInlets", "VacantLots",
#                   "H_Streams", "MedHHInc", "PopDensity", "treecall.isSig", "treecall.isSig.dist")

#  Version 3
# reg.vars <- c("PhillyTrees", "sum_TREE_DBH", "SewerInlets", "PopDensity", "HomeOwnRate")
# 
# reg.ss.vars <- c("PhillyTrees", "sum_TREE_DBH", "SewerInlets", "PopDensity", "HomeOwnRate",
#                  "treecall.isSig", "treecall.isSig.dist")

 # Version 4
reg.vars <- c("sum_TREE_DBH", "SewerInlets", "PopDensity")

reg.ss.vars <- c("sum_TREE_DBH", "SewerInlets", "PopDensity",
                 "treecall.isSig", "treecall.isSig.dist")


#  Version 5
# reg.vars <- c("PhillyTrees", "PopDensity")
# 
# reg.ss.vars <- c("PhillyTrees", "SewerInlets", "PopDensity",
#                  "treecall.isSig", "treecall.isSig.dist")

‘Leave One Group Out’ Cross-Validation on spatial features

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countRequests",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countRequests, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countRequests",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countRequests, Prediction, geometry)

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "neighborhoods",
  dependentVariable = "countRequests",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = neighborhoods, countRequests, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "neighborhoods",
  dependentVariable = "countRequests",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = neighborhoods, countRequests, Prediction, geometry)


Cross-validation is a technique used to assess how well a model will generalize to an independent dataset. It involves partitioning the original dataset into a training set to train the model and a test set to evaluate it.

A table of Mean Absolute Error and standard deviation Mean Absolute Error for each model

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countRequests,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - countRequests,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countRequests,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - countRequests,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf()
error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countRequests, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()


st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable() %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(0, color = "black", background = "white") %>%
    row_spec(1, color = "black", background = "grey") %>%
    row_spec(2, color = "black", background = "#FDE725FF") %>%
    row_spec(3, color = "black", background = "grey") %>%
    row_spec(4, color = "black", background = "#FDE725FF") 
Regression Mean_MAE SD_MAE
Random k-fold CV: Just Risk Factors 0.33 0.28
Random k-fold CV: Spatial Process 0.32 0.27
Spatial LOGO-CV: Just Risk Factors 0.78 0.90
Spatial LOGO-CV: Spatial Process 0.76 0.90

These are some pretty low MAE values. How do they break down across differing demographics?

Table of raw errors by race context for a random k-fold vs. spatial cross validation regression

tracts18 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2018, state=42, county="101", geometry=T) %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  st_transform(crs = 2272) %>%
  .[neighborhoods,]

# Create the table
error_by_reg_and_race <- 
  reg.summary %>%
    st_centroid() %>%
    st_join(tracts18) %>%
    na.omit() %>%
    st_drop_geometry() %>%
    group_by(Regression, raceContext) %>%
    summarize(mean.Error = mean(Error, na.rm = T)) %>%
    spread(raceContext, mean.Error)

# Display the table
error_by_reg_and_race %>%
  kable(caption = "Mean Error by Regression Type and Neighborhood Racial Context") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(0, color = "black", background = "white") %>%
  row_spec(1, color = "black", background = "grey") %>%
  row_spec(2, color = "black", background = "#FDE725FF") %>%
  row_spec(3, color = "black", background = "grey") %>%
  row_spec(4, color = "black", background = "#FDE725FF")

This table tells us a few things: first, all four of our models are worse at predicting the likelihood of a street tree-related 311 call in majority non-white neighborhoods than majority white neighborhoods, both for the risk factors alone and when considering the spatial processes. Second, the model underestimates the number of street tree 311 calls for majority non-white neighborhoods and slightly overestimates the number of calls for majority white neighborhoods. Lastly, the inclusion of spatial processes marginally improves the accuracy of the model in both majority white and non-white neighborhoods.
We must keep in mind, however, that these errors are means of all the errors across the city, so it may be the case that the positive and negative errors (over- and under-estimation of street tree 311 calls) are cancelling each other out to some degree. Let’s make a map to investigate this.

A Multiple map of model errors by random k-fold and spatial cross validation

# Create the small multiple map with the midpoint set at 0
ggplot(data = reg.summary) +
  geom_sf(aes(fill = Error), color = NA) +
  facet_wrap(~ Regression, ncol = 2) +
  theme_minimal() +
  labs(title = "Model Errors by Random K-Fold and Spatial Cross Validation",
       fill = "Error") +
  scale_fill_gradient2(low = palette_diverging[1], 
                       mid = palette_diverging[2], 
                       high = palette_diverging[3], 
                       midpoint = 0) +
  theme(legend.position = "bottom")

In this multiple map, positive errors (over-predictions) are shown as red and negative errors (under-predictions) are shown as blue.
The highest numbers of over-predictions appear to occur downtown, whereas under-predictions are more evenly distributed across the city.
Since there are lots of errors on either side of zero, our low mean error is probably not the best measure of model accuracy.

Testing Predictions

Get 2019 Street Tree 311 call data

StreetTreeRequests19 <- st_read("https://raw.githubusercontent.com/olivegardener/musa_5080_2023/main/Week_6/PredictiveTreeing/Data/StreetTreeRequests2019.geojson") %>%
  .[fishnet,]

Density vs predictions

TreeCall_ppp <- as.ppp(st_coordinates(StreetTreeRequests19), W = st_bbox(final_net))
TreeCall.1000 <- density.ppp(TreeCall_ppp, resolution)

tree_KDE_sum <- as.data.frame(TreeCall.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) 
kde_breaks <- classIntervals(tree_KDE_sum$value, 
                             n = 5, "fisher")
tree_KDE_sf <- tree_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(StreetTreeRequests19) %>% mutate(TreeCallCount = 1), ., sum) %>%
    mutate(TreeCallCount = replace_na(TreeCallCount, 0))) %>%
  dplyr::select(label, Risk_Category, TreeCallCount)
# reg.cv
# reg.ss.cv
# reg.spatialCV
# reg.ss.spatialCV

ml_breaks <- classIntervals(reg.ss.cv$Prediction, 
                             n = 5, "fisher")

tree_risk_sf <- reg.ss.cv %>%
  mutate(label = "Risk Predictions",
         Risk_Category =classInt::findCols(ml_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(StreetTreeRequests19) %>% mutate(TreeCallCount = 1), ., sum) %>%
      mutate(TreeCallCount = replace_na(TreeCallCount, 0))) %>%
  dplyr::select(label,Risk_Category, TreeCallCount)

Map comparing kernel density to risk predictions for the next year’s crime

rbind(tree_KDE_sf, tree_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    # geom_sf(data = sample_n(StreetTreeRequests19, 3000), size = .1, colour = "black", alpha = .1) +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2019 Street Tree 311 Calls") +
    mapTheme(title_size = 14)

These two maps indicate that our predicted calls are more spread out across the city than the actual density for 2019. Higher risk areas appear to be under-predicted and lower risk areas appear to be over-predicted.

Bar plot comparing kernel density to risk predictions for the next year’s crime

rbind(tree_KDE_sf, tree_risk_sf) %>%
  st_drop_geometry() %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countTreeCalls = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Pcnt_of_test_set_calls = countTreeCalls / sum(countTreeCalls)) %>%
    ggplot(aes(Risk_Category,Pcnt_of_test_set_calls)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE, name = "Model") +
      labs(title = "Call prediction vs. Kernel density, 2019 Street Tree 311 Calls",
           y = "% of Test Set Calls (per model)",
           x = "Risk Category") +
  theme_bw() +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5))


This bar chart confirms this suspicion.


Conclusion


I would not recommend this algorithm be used to predict where Street Tree 311 calls will be placed. The algorithm, when trained on 2018 data, failed to predict calls for 2019 data with acceptable accuracy. Though the average percent error was small, the map and bar chart above indicate that the algorithm over-estimates risk in the lower risk areas and underestimates risk in higher risk areas. This is the opposite of what one would want from such an algorithm.

Further undermining the effectiveness of this model is the fact that it underestimates the number of street tree 311 calls for majority non-white neighborhoods and overestimates the number of calls for majority white neighborhoods. In the interest of making preventative tree care more equitable, one would want to refine the model to produce predictions that are more evenly accurate across demographics. This could be done through investigation of a wider range of independent variables to improve prediction, further refinement of the feature engineering process for these variables, setting a limit to the number of calls per cell so certain areas with the ‘frequent caller/frequent faller’ phenomenon skew the training of the algorithm less, among other strategies.

Street trees are a vital part of Philadelphia’s resilience infrastructure. A tool such as this, once properly refined, could save the city millions of dollars per year in emergency maintenance costs by better enabling preventative care. Carpe tree-em!